Real-time Change Data Capture Using Staging Tables and Delta View Generation for Incremental Loading of Large Dimensions in a Data Warehouse

نویسنده

  • Prabhudev Irabashetti
چکیده

In the big data era, data become more important for Business Intelligence and Enterprise Data Analytics system operation. The load cycle of traditional data warehouse is fixed and longer, which can’t timely response the rapid and real time data change. Real-time data warehouse technology as an extension of traditional data warehouse can capture the rapid data change and process the real-time data analysis to meet the requirement of Business Intelligence and Enterprise Data Analytics system. The real-time data access without processing delay is a challenging task to the real-time data warehouse. In this paper we discusses current CDC technologies and presents the theory about why they are unable to deliver changes in real-time. This paper also explain the approaches of dimension delta view generation of incremental loading of real-time data and staging table ETL framework to process the historical data and real-time data separately. Incremental load is the preferred approach in efficient ETL processes. Delta view and stage table framework for a dimension encompasses all its source table and produces a set of keys that should be incrementally processed. We have employed this approach in real world project and have noticed an effectiveness of real-time data ETL and reduction in the loading time of big dimension. INTRODUCTION Data transformations are often the most complex and, in terms of processing time, the most costly part of the extraction, transformation, and loading (ETL) process. Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. This is the first step of the ETL process. After the extraction, this data can be transformed and loaded into the data warehouse. They can range from simple data conversions to extremely complex data scrubbing techniques. The data acquisition process represents one of the most technically demanding and one of the most expensive parts of building a data warehouse. The data needed for the data warehouse are being extracted from their sources during an extract, transfer and load (ETL) process. Typical data warehouse architecture includes a data staging area where the extracted data are stored temporarily, and subsequently, upon cleaning and transformation phase, transferred into the data warehouse. Usually, the traditional data warehouse only supports the historical information query and analysis, which cannot obtain the upto-date real-time data. After the first or initial loading, data warehouse must be periodically refreshed to keep up with source data updates. Data warehouse loading is usually done periodically, by a background process. The update patterns (daily, weekly, etc.) for traditional data warehouses and data integration process result in outdated data to a greater or lesser extent. The naive approach to refreshing a data warehouse is referred to as full reloading. Possible refreshment scenario is to repeat the initial loading process using modified source data, compare the results with the current DW in order to determine changes that need to be done and finally perform the changes. This strategy is known as full DW reload. With the increasing size and complexity of DW, full reloading becomes inadequate, and in some cases inapplicable. More appropriate approach is a gradual change of DW in accordance with the changes that have occurred in the data sources since the last synchronization. Only the data that has changed since the previous reload needs to be transformed and processed. This approach is known as incremental reloading. Incremental reload work quicker and efficiently than full reload. The basic of incremental and real-time loading is that the changes of source data can be captured and later on propagated to the data warehouse. There is a number of known techniques for changed data capture. Novateur Publication’s International Journal of Innovation in Engineering, Research and Technology [IJIERT] ICITDCEME’15 Conference Proceedings

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

افزایش سرعت نگهداری افزایشی دید با استفاده از الگوریتم فاخته

Data warehouse is a repository of integrated data that is collected from various sources. Data warehouse has a capability of maintaining data from various sources in its view form. So, the view should be maintained and updated during changes of sources. Since the increase in updates may cause costly overhead, it is necessary to update views with high accuracy. Optimal Delta Evaluation method is...

متن کامل

Incremental ETL Pipeline Scheduling for Near Real-Time Data Warehouses

We present our work based on an incremental ETL pipeline for on-demand data warehouse maintenance. Pipeline parallelism is exploited to concurrently execute a chain of maintenance jobs, each of which takes a batch of delta tuples extracted from source-local transactions with commit timestamps preceding the arrival time of an incoming warehouse query and calculates Ąnal deltas to bring relevant ...

متن کامل

Formalizing ETL Jobs for Incremental Loading of Data Warehouses

Extract-transform-load (ETL) tools are primarily designed for data warehouse loading, i.e. to perform physical data integration. When the operational data sources happen to change, the data warehouse gets stale. To ensure data timeliness, the data warehouse is refreshed on a periodical basis. The naive approach of simply reloading the data warehouse is obviously inefficient. Typically, only a s...

متن کامل

Improvement of the Analytical Queries Response Time in Real-Time Data Warehouse using Materialized Views Concatenation

A real-time data warehouse is a collection of recent and hierarchical data that is used for managers’ decision-making by creating online analytical queries. The volume of data collected from data sources and entered into the real-time data warehouse is constantly increasing. Moreover, as the volume of input data to the real time data warehouse increases, the interference between online loading ...

متن کامل

بهبود الگوریتم انتخاب دید در پایگاه داده‌‌ تحلیلی با استفاده از یافتن پرس‌ وجوهای پرتکرار

A data warehouse is a source for storing historical data to support decision making. Usually analytic queries take much time. To solve response time problem it should be materialized some views to answer all queries in minimum response time. There are many solutions for view selection problems. The most appropriate solution for view selection is materializing frequent queries. Previously posed ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016